Diagnosis of Email Spams – Some Statistical Considerations

نویسنده

  • K. Srikanth
چکیده

–While email is one of the fastest form of communication, the user is frequently faced with receiving unsolicited emails called spams. Nonspam mails are known as hams which are legitimate mails. It is practically very difficult to perfectly classify a mail into spam or ham basing on the content or subject of the mail. Several statistical methods are available which classify mails with some chance of misclassification. The most popular is the Bayesian approach which use the conditional probability of occurrence of given words in the spam/ham groups of the training data. Most of the content-based classifiers are based on word tokenization leading to large corpus of words along with their probabilities of occurrence. In this paper, we discuss some statistical properties of data sets used as corpora for training classifiers. I. THE NATURE OF SPAM MAILS Email has been an efficient and popular communication medium as the number of internet users increase during the recent ten years. Therefore email management is an important and growing problem for individuals and organizations because it is prone to misuse. Email spam is an unsolicited, unwanted email that is sent indiscriminately, directly or indirectly by a sender having no relationship with recipient. Email spam has steadily grown since 1990‟s. According to Rebecca Lieb (2002), Botnets networks of Virus-infected computers used to send about 80% of spam. A significant amount of time and resources is wasted by examining the spams and deleting them and the cost is borne by the recipient. Spammers are the people who send unsolicited mails to different users. Spammers collect email address from chat rooms, websites and customer lists, which harvest users address books and sell to other spammers Characteristics of Spam mails Filtering of spam emails is an important task for email providers as well as individuals. We need to be able to distinguish spam from legitimate mails. To do this, we need to identify typical spam characteristics and filters, to block these spam messages. Individuals can define their own filters to block unsolicited mails. Spammers continuously improve their spam tactics and create disturbances to the internet users. So it is important to keep up to date on new spam filters from time to time to make spam blocked. Spam characteristics generally appear in two parts of a message, Email headers and message contents. Email Headers indicates the way the mail reaches the destination. It has other information about sender and recipient address, message ID, date and time of transmission, subject and other email characteristics. Most of the spammers try to hide their identity by forging email headers to hide the real source of message. Spammers use mass mailing method to send mails to large number of recipients. Message contents use certain language in their email message where companies use to distinguish spam messages from others. Words and phrases like free, click here, act now, risk free, lose weight, earn money, exclamation marks (!) and capital letters in their messages to attract the attention of the recipient. Many spam emails are mainly from web ads. According to the commtouch report (2010), there were 183 billion of spam mails sent daily to internet users. Among them, the most popular is Pharmacy ads with (81%) followed by Replica (5.40%), Enhancers (2.30%), Phishing (2.30%), Degress (1.30%) and Casino (1%). II. STATISTICS OF EMAIL SPAM Spam mails are constantly increasing day by day, in which the amount of spam for internet users in their mailboxes is only a portion of total spam sent. According to Josh Halliday (2011), the quantum of spam messages was estimated to be around 200 billion sent per day. A survey by European email users and US (2010) showed that despite knowing the risks of opening spam mails, 46% of users still opened them and putting their computers into risk. As per the year wise report, from the starting of 2002 there were 2.4 billion spam mails per day and in the year 2004 it reached to 11 billion per day. In the mid of June 2007 it goes to 100 billion spam mails per day. Spam mails rate increased a lot for the year 2010 January reached to 183 billion per day. According to Steve Ballmer (2004), Microsoft founder Bill Gates receives four million emails per year, most of them spam, at the same time Jef Poskanzer (2006), owner of the domain name “acme.com”, was receiving over one million spam e-mails per day. Sophos (2008), has reported the countries which are major sources of spam as given in Table1. Table-1: country wise spam statistics

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploiting Latent Content based Features for the Detection of Static SMS Spams

As the use of mobile phones grows, spams are becoming increasingly common in mobile communication such as SMS, calling for research on SMS spam detection. Existing detection techniques for SMS spams have been mostly adapted from those developed for other contexts such as emails and the web without taking into account some unique characteristics of SMS. Additionally, spamming tactics is constant...

متن کامل

Correlations and Omori law in Spamming

The most costly and annoying characteristic of the e-mail communication system is the large number of unsolicited commercial e-mails, known as spams, that are continuously received. Via the investigation of the statistical properties of the spam delivering intertimes, we show that spams delivered to a given recipient are time correlated: if the intertime between two consecutive spams is small (...

متن کامل

Mitigating the Impact of Spams by Internet Content Pollution

In recent years, there has been a steep rise in the amount of unsolicited-emails (spams) [11]. Such mails overwhelm users’ mailboxes, consume server resources and cause delays to mail delivery. Many techniques [2, 10, 12, 5, 13] have been used for mitigating spams. Despite the plethora of schemes proposed, all of them have the cardinal problem of false positives which compromises the reliabilit...

متن کامل

An Overview of Content-Based Spam Filtering Techniques

So fast, so cheap, so efficient, Internet is nowadays incontestably communication mean of choice for personal, business and academic purposes. Unfortunately, Internet has not only this beautiful face. Malicious activities enjoy as well this so fast, cheap and efficient mean. The last decade, Internet worms took the lights. In the recent years, spams are invading one of the most used services of...

متن کامل

A New Hybrid Approach of K-Nearest Neighbors Algorithm with Particle Swarm Optimization for E-Mail Spam Detection

Emails are one of the fastest economic communications. Increasing email users has caused the increase of spam in recent years. As we know, spam not only damages user’s profits, time-consuming and bandwidth, but also has become as a risk to efficiency, reliability, and security of a network. Spam developers are always trying to find ways to escape the existing filters therefore new filters to de...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012